Creating Synthetic Data in Excel

To evaluate new methods and to diagnose problems with modeling processes, we often need to generate synthetic data. This allows us to precisely control the data going into our modeling methods and then check the output to see if it is as expected. In this lab, you'll use Excel to create point and raster data sets for use in trend surface and interpolation analysis.

1. Creating Random Point Data

In Excel, create two columns, one labeled X and the other Y. Insert the following command into the first space below the "X":

=RANDBETWEEN(-1000,1000)

When you hit return, you should see a random value between -1000 and 1000 appear. Click on the cell and drag the corner with the square down about 30 points. Then, grab the square again and drag to fill the "Y" column. Save this file as a "CSV" file with a good name like "RandomPoints.csv". Then, load the file into ArcGIS using the "File -> Add Data -> Add X,Y Data..." option.

Note: The random function does not create truly random numbers because computers are deterministic machines. However, for our purposes, these numbers will be just fine.

Question 1: Does the data look clustered, random, or dispersed to you?

This approach can be used to create random point data sets for testing clustering and other methods that do not require a measured variable.

2. Creating Random Points with Random, Gradient, and Uniform Values

Return to Excel and add two additional columns to your file named "Random" and "Uniform", and "Gradient". Set the first column to random values as you did for X and Y. Set the entire Uniform column to 10. Then, set the Gradient column to match the X column. Save the file to a new name and load it into ArcGIS.

Symbolize each of the columns in ArcGIS and take note of what you see.

Question 2: What are the Moran's I value for each data set? Why do they show these values?

3. Getting Control of Polynomials

There is a large area of modeling that uses polynomial expressions to model phenomenon. As a review of polynomials, remember that the equation for a line is:

y=mx+b

Where m is the slope of the line and b is the intercept. When we are doing regression, the "b" represents the value of x when the covariant is 0. The "m" is than the relationship between x and y. Another way to say this is if "m" is small, then y changes little as x changes, if "m" is large, then y changes a lot as x changes. Since the exponent on "x" is one, this is referred to as a "first order" polynomial.

Note: m and b are the "parameters", also called "coefficients" for this model.

Add three more columns labeled, "first", "second", and "third". Enter values for x Enter the equation of a line for the "first" column. Add values to the spread sheet off the right for the "m" values and "b" values using the "$" operator to keep the cells referenced from changing as follows:

=A2*$M$2+$M$3

Create a scatter graph of x vs. your "first" order equation. Change the values of "m" and see if the line responds as you expect.

This is fine and has some applications but most natural phenomenon does not respond linearly. We can add additional parameters to have "higher order" equations that are more complex and interesting. The standard form for a second order polynomial is:

y=a+x*b+x^2*c

Where the ^2 represents raising x to the second power or x*x. Enter this equation in the column labeled "second" in Excel and try values of a, b, and c and graph the results.

You may find that it is challenging to get anything other than a straight line or a single exponential curve. To see something more interesting, you'll need to think about what is happening with each piece of the equation.

Question 3: What effect does changing the "b" value have?

Question 4: Make the "c" value negative and what effect does it have now?

To have an interesting curve, you'll need to make the "c" value about 1/10 the size of the "b" value. This is because squaring the "x" value makes it increase very rapidly and dominate the y values.

Try the "third order" polynomial below in the column labeled "third".

y=a+x*b+x^2*c+x^3*d

Create another scatter gram and experiment with the coefficients. You'll find you need to again decrease the value of "d" and reverse it's sign from "c" to create an interesting curve.

Note: The most important learning here is how challenging it is to have polynomials represent complex phenomena. Polynomials have their place but they are challenging to work with and typically do not respond in the way that natural spatial phenomena do. Over the next weeks, we'll be learning other techniques that use different mathematics to create spatial models.

Note that you can add additional covariants to a polynomial very easily. The general form for a multivariate linear (first order) equation is then:

y=x*B1+x*B2+x*B3+B

Where B1, B2, and B3 are the slope values ("m" from above) that determine how y responds to the x value.

4. Evaluating a Trend Surface

Save your spreadsheet to an "Excel (xls)" file and then copy and paste just the "values" into a new spreadsheet. Save this file as a CSV. Load your "CSV" file into ArcGIS and export it to a shapefile. Select "Geostatistical Analyst -> Explore Data -> Trend Analysis". This dialog attempts to fix a polynomial regression through your data. Select different attributes and see if the shapes of each curve is what you would expect.

Now, run the "Trend" tool (it's easiest to search for it). This tool will attempt to create a raster the matches the overall trend in your data. Try different values for the "Polynomial Order" against your "first", "second", and "third" order attributes.

Note: ArcGIS always defaults to a categorical symbology for continuous rasters. This can be misleading. I recommend you change the symbology to stretched for continuous rasters.

Question 5: What are the coefficients and RMS error for each surface? (see the Trend Tools' help for information on the file output).

Question 6: Are these values what you expected?

By the way, you just "modeled" a trend within your data! The coefficients represent the parameters for the model.

We now want to remove the trend from our data to see how much of the variance in the data is explained by the trend. You can do this by; adding a new attribute to your point file, "extract by point" to add the value from your trend surface to the point attribute, and running field calculator to find the difference between the points and the trend surface values. The result will be your residuals.

Create a histograms for the original values, your trend surface, and your residuals.

Question 7: How good a job did the trend surface tool do at removing the trend in your data?

6. Creating data with auto correlation

Another phenomenon in the real world is that things that are closer together tend to be more a like. This can be because of a trend that is from another phenomenon or because trees and other species tend to spread seeds near themselves more than far away. After we remove any trends, we want to understand if there is any auto correlation in the data.

Note: Auto correlation is often a trend that has yet to be discovered.

The gradient dataset from above is highly auto-correlated but this is also an easy trend to detect. Trigonometric functions (Sine and Cosine) can be used to create patterns of values that change spatially over a grid. Try entering the following as an equation for a column in your point spreadsheet in Excel:

=SIN(A2/500)*COS(B2/500)

Bring this data into ArcMap and examine it visually and then run Moran's I. Changing the value "500" will change the resolution of the grid pattern.

Question 8: What is the value of Moran's I?

7. Creating Complex Point Data

Now, add another column that has random data in it to your spreadsheet in Excel. Then create a another column and add your third order polynomial, the auto correlated data, and your random data together in the new column. Plot this against "x" to see the result. Now, add a coefficient to change how much each column is represented in your final data set.

This task allows you to create relatively complex point data in a very controlled and well understood way. This will be very valuable to create test data sets to evaluate tools and even develop new approaches with confidence.

Note that the process you have just completed in one lab is vary similar to most of the modeling tasks. You have point data with measured values and you are trying to find trends, auto correlation, and random noise in the data. The trends are typically from covariants while auto correlation represents interactions between the topic of interest and itself (and possibly other things). Also, all the data we have has some noise and the world is pretty noisy so we have to expect a good bit of noise to show up.

8. Putting it all together

Your task is to create a point data set with at least 30 points that includes at least a second order polynomial trend in one direction. The data set should also include some auto correlation that is not part of a trend surface and some random noise. Then, use the trend surface to evaluate the data set and remove the trend. Then, find the residuals and run the Geostatistical Wizard to determine if there is auto correlation and create an interpolated surface. Finally, find the remaining random component of the data.

You have probably run interpolation before and hopefully were taught to check for auto correlation before hand. Before doing interpolation, we should really check for overall trends in the data and then check for auto correlation and then perform interpolation on the values that are left over (residuals). If you are not familiar with the Geostatistical Wizard, refer to the Interpolation lab in GSP 470 for instruction.

Turn in

Turn in a short report on what you found when you created the data set from section 8, above. You do not have to create final maps for this report, screen shots are fine.

Images of your original data, residuals, trend surface, and a histogram of the final random data (note that for professional publications, you'll typically need to redo graphs in Excel but I'll make exceptions to reduce the amount of time it takes to complete some labs).
The equation you used to create the trend surface, the equation for the auto correlation, and the amount of random noise.
Answer the questions above.

Extra Credit: Run the trend surface and auto correlation with smaller and smaller numbers of points. What happens to the results?